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[0005] Improved performance over conventional SISD CPUs may be achieved by building 
systems which exhibit parallel processing capability. Typically, parallel processing systems 
use multiple processing units or processing elements to simultaneously perform one or more 
tasks on one or more data streams. For example in one class of parallel processing system, 
the results of an operation from a first CPU are passed to a second CPU for additional 
processing, and from the second CPU to another CPU, and so on. Such a system, commonly 
known as a "pipeline", is referred to as a multiple-instruction, single-data or MISD system 
because each CPU receives a different instruction stream while operating on a single data 
stream. Improved performance may also be obtained by using a system which contains many 
autonomous processors, each running its own program (even if the program running on the 
processors is the same code) and producing multiple data streams. Systems in this class are 
referred to as a multiple-instruction, multiple-data or MUVID system. 
[0006] Additionally, improved performance may be obtained using a system which has 
multiple identical processing units each performing the same operations at once on different 
data streams. The processing units may be under the control of a single sequencer running a 
single program. Systems in this class are referred to as a single-instruction, multiple data or 
SIMD system. When the number of processing units in this type of system is very large (e.g., 
hundreds or thousands), the system may be referred to as a massively parallel SIMD system. 
[0007] Nearly all computer systems now exhibit some aspect of one or more of these types 
of parallelism. For example, MMX extensions are SIMD; multiple processors (graphics 
processors, etc) are MIMD; pipelining (especially in graphics accelerators) is MISD. 
Furthermore, techniques such as out of order execution and multiple execution units have 
been used to introduce parallelism within conventional CPUs as well. 
[0008] Parallel processing is also used in active memory applications. An active memory 
refers to a memory device having a processing resource distributed throughout the memory 
structure. The processing resource is most often partitioned into many similar processing 
elements (PEs) and is typically a highly parallel computer system. By distributing the 
processing resource throughout the memory system, an active memory is able to exploit the 
very high data bandwidths available inside a memory system. Another advantage of active 
memory is that data can be processed "on-chip" without the need to transmit the data across a 
system bus to the CPU or other system resource. Thus, the work load of the CPU may be 
reduced to operating system tasks, such as scheduling processes and allocating system 
resources. 

[0009] A typical active memory includes a number of interconnected PEs which are capable 
of simultaneously executing instructions sent from a central sequencer or control unit. The 
PEs may be connected in a variety of different arrangements depending on the design 
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requirements for the active memory. For example, PEs may be arranged in hypercubes, 
butterfly networks, one-dimensional strings/loops, and two-dimensional meshes, among 
others. 

[0010] A typical PE may contain data, for example a set of values, stored in one or more 
registers. In some instances, it may be desirable to determine the extrema (e.g., the highest or 
lowest value) of the set of values on an individual PE. Furthermore, it may be desirable to 
find the extrema for an entire array of PEs. Conventional methods for finding the extrema, 
however, often results in a number processing cycles being "lost." A lost cycle may refer to, 
for example, a cycle in which the PE must wait to complete a calculation because the 
necessary data has yet to be transferred into or out of the PE. 

[0011] One approach for finding the global extrema of a set of shorts (i.e. , a "short" refers to 
a 16-bit value) for an array of 8-bit processors transmits the bytes in the order in which they 
are needed for comparison in the PE. The 8-bit PE processes each short as two separate 
bytes, a "most significant" MS byte and a "least significant" (LS) byte. Once started, for 
continuous operation, this approach requires a further four (4) cycles per short. First, the local 
LS-byte of the needed short is loaded onto the network during the first clock pulse and 
transferred to the PE during the second clock pulse. Next, the local MS-byte of the needed 
short is loaded onto the network during the third clock pulse and transferred to the PE during 
the fourth clock pulse. As can be seen, four (4) cycles are required to transfer the needed 
short to the PE. Thus to transfer sixteen (16) shorts, sixty-four (64) cycles are required. 
[0012] Also, two (2) cycles are required for the PE to compare one short to another short. 
For example, the LS-byte of short- 1 is compared to the LS-byte of short-2 in a first cycle and 
the MS-byte of short- 1 is compared to the MS-byte of short-2 in a second cycle. For sixteen 
(16) values, fifteen (15) comparisons are required. Thus of the total sixty-four (64) cycles, the 
PE is "working" a minimum of thirty (30) cycles and is idle for thirty-four (34) cycles. 
Accordingly, this approach is considered to have a ^'transfer bottleneck" because the idle 
cycles are caused by the way the bytes are transferred. 

[0013] A second approach attempts to minimize the time required to transfer the shorts by 
first transferring all of the LS-bytes to the PE and then transferring all of the MS bytes to the 
PE. Once started, for continuous operation, this approach requires approximately 3 cycles per 
short. For example for sixteen (16) PEs each having one local short, sixteen (16) cycles are 
needed to transfer each short's LS-byte to each other PE and to collect the sixteen (16) LS 
bytes in the PE's register files. An additional sixteen (16) cycles are then needed to transfer 
each short's MS-byte to each other PE and to start comparing the shorts to each other. 
Another fifteen (15) cycles are needed to finish comparing the shorts. It should be noted that 
the PE cannot start comparing the shorts until the first MS-byte is transferred. After the first 
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MS-byte is transferred, the PE requ,res 30 cycles to finish comparing all sixteen (16) shorts 
Thus of the forty six total cycles, the transfer network is working for thirty-two (32) cycles 
and 1S ld le for fourteen (14) cycles. Accordingly, this approach is considered to have a 
"processing bottleneck" because the idle cycles are caused by the way the bytes are 
processed. It should be noted that each of the approaches discussed above may also require 
additional cycles for initialization and termination of the process. 

[0014] Each of the approaches discussed above have idle or "lost" cycles. Thus, there exists 
a need for a method for determining the extrema of a set of values on an array of parallel 
processors such that the resources of the parallel processing system are maximized. More 
specifically, there exists a need for a method for determining the extrema of a set of values on 
an array of parallel processing elements of an active memory such that the resources of the 
active memory are maximized. 

SUMMARY OF THE INVENTION 

[0015] One aspect of the present invention relates to a method for finding an extrema for an 
n-dimensional array having a plurality of processing elements comprising determining within 
each of the processing elements a first dimensional extrema for a first dimension of the n- 
dimensional array, wherein the first dimensional extrema is related to one or more local 
extrema of the processing elements in the first dimension and wherein the first dimensional 
extrema has a most significant byte and a least significant byte, determining within each of 
the processing elements a next dimensional extrema for a next dimension of the n- 
dimensional array, wherein the next dimensional extrema is related to one or more of the first 
dimensional extrema and wherein the next dimensional extrema has a most significant byte 
and a least significant byte, and repeating the determining within each of the processing 
elements a next dimensional extrema for each of the n-dimensions, wherein each of the next 
dimensional extrema is related to a dimensional extrema from a previously selected 
dimension. 

[0016] Another aspect of the present invention relates to a method for identifying extrema 
within a data stream as having one of an odd or an even position, the extrema having a most 
significant byte and a least significant byte, processing the extrema having an odd position to 
produce an odd extrema, the odd extrema having a most significant byte and a least 
significant byte, processing the extrema having an even position to produce an even extrema 
the even extrema having a most significant byte and a least significant byte, and determine a 
dimensional extrema from the odd extrema and the even extrema, the dimensional extrema " 
having a most significant byte and a least significant byte. 
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10017] Another aspect of .he present invention relates to a method for detennining . 
d.mens.ona, extrema for an n-d,me„sional array of processing Cements. The method 
comprises loading odd numbered exhema from a set of the processing Cements ,„ a firs, 
dtmenston into a firs, p,ura„ty of registers , , oading ^ ^ 

processing elements into a second plurality of reei^c o™, • , . 

, , p ^ or re S lst ers, comparing certain of the loaded odd 

in "T 10 produce an odd extema ' *■ odd eMrcma hav ™ 8 a — si ^- 

and a feast Slgnlficant byte ^ rf ^ ^ ^ 

produce an evm exfrema> ^ evm ex(rema havjng a mos) ^ ^ 

stgntficant byte, and producing a dimension, extrema .„ response to the odd externa and the 
even extrema, the d,me„s,„„a, extrema having a most significant byte and a ,eas. significant 

WMS, The present invention enables the mute-byte extrema of a set of vaioes distributed 
across an array of parafie, processors to be detemhned „h„e maximising the reaources of the 
parallel processtng system. More specifically, the ,eas, significant bytes and tha most 
atgntfican, bytes of da, for the set of values are dtstrrbuted in burs, to reduce the amount of 
1* among other, The pmsen, invention may be fc 
software (,.e., fit. ,oca, processtng capabtltty) of each PE within the anay. Those advantages 
and benefits, and others, win become apparent fiom the description of Ure invention below 

BRIEF DESCRIPTION OF THE DRAWINGS 

10019] To enable the present invention to be eaafiy understood and readily practiced fine 
present invention wil, now be described for purposes of illustration and no, limitation in 
connection with Ihc following figure!, wherein: 

[0020, FIG. 1 is , blocx dtagram illustrating an active memory accordtng to an embndtmen, 
of the present invention, 

[002,, FIG. 2 is a block diagram of a processing element for the active memory .Uustrated in 
FIG. I accordmg to an embodiment of the present invention 

(0022) FIG. 3 is a more deta.led illustrarion of the processtng elements of FIG. 2 according 
to an embodiment of the present invention. 

[0023, FIG. 4 ifiusnates an operattona, process for defining a global extrema for an array 
of processtng Cements according to an embodunen, of the present invention 
024) FIG. 5 is an operational process for detenntning a„ extrema of a s.ngle dtmension of 
n-d,mens,ona, anay of processtng elements accordtng to an embodiment of the present 
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[0025] FIGS. 6a - 6h graphically represent the operational process of FIG. 5 as applied to a 
single line of the array 28 illustrated in FIG. 7 according to an embodiment of the present 
invention. 

[0026] FIG. 7 illustrates processing elements of FIG. 2 arranged in a loop-connected two- 
dimensional array according to an embodiment of the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

[0027] As discussed above, parallel processing systems may be placed within one or more 
classifications (e.g., MISD, MIMD, SIMD, etc.). For simplicity, the present invention is 
discussed in the context of a SIMD parallel processing system. More specifically, the present 
invention is discussed in the context of a SIMD active memory. It should be noted that such 
discussion is for clarity only and is not intended to the limit the scope of the present invention 
in any way. The present invention may be used for other types and classifications of parallel 
processing systems. _ 

[0028] FIG. 1 is a block diagram illustrating an active memory 10 according to an 
embodiment of the present invention. It should be noted that the active memory 10 is only 
one example of a device on which the methods of the present invention may be practiced and 
those of ordinary skill in the art will recognize that the block diagram of FIG. 1 is an 
overview of an active memory device 10 with a number of components known in the art being 
omitted for purposes of clarity. 

[0029] Active memory 10 is intended to be one component in a computer system. 
Processing within active memory 10 is initiated when the active memory 10 receives 
commands from a host processor (not shpwn), such as the computer system's CPU. A 
complete processing operation (i.e., data movement and processing) in the active memory 10 
may consist of a sequence of many commands from the host to the active memory device 10. 
[0030] Active memory 10 is comprised of a host memory interface ("HMI") 12, a bus 
interface 14, a clock generator 16, a task dispatch unit ("TDU") 18, a DRAM control unit 
("DCU") 20, a DRAM module 22, a programmable SRAM 24, an array control sequencer 26, 
and a processing element array 28, among others. 

[0031] The HMI 12 provides an input/output channel between the host (such as a CPU, not 
shown) and the DRAM module 22. In the current embodiment, the HMI 12 receives 
command (cmd), address (addr), and data signals (among others) from and sends data and 
ready (rdy) signals (among others) to the host. The HMI 12 approximates the operation of a 
standard non-active memory so that the host, without modifications, is compatible with the 
active memory 10. 



00444148.DOC 



6 



[0032] The HMI 12 may be similar in its operation to the interface of a synchronous DRAM 
as is know in the art. Accordingly, the host must first activate a page of data to access data 
within a DRAM module 22. In the current embodiment, each page may contain 1024 bytes of 
data and there may be 16,384 pages in all. Once a page has been activated, it can be written 
and read through the HMI 12. The data in the DRAM module 22 may be updated when the 
page is deactivated. The HMI 12 also sends contro, signals (among others) to the DCU 20 
and to the processing element array 28 via the task dispatch unit 18. 

10033] The HMI 12 may operate at a frequency different than that of the frequency of the 
master clock. For example, a 2x internal clock signal from clock generator 16 may be used 
Unlike a traditional DRAM, the access time for the HMI 12 uses a vanable number of cycles 
to complete an internal operation, such as an activate or deactivate. Thus, the ready s.gnal 
(rdy) is provided to allow the host to detect when a specific command has been completed 
[0034] The bus interface 14 provides and input/output channel between the host and the 
TDU 18. For example, the bus interface 14 receives column select (cs), write command (w) 
read command (r), address (addr), and data signals (among others) from and places interrupt' 
Ontr), flag, and data signals (among others) onto the system bus (not shown). The bus 
interface 14 also receives signals from and sends signals to TDU 18. 

[0035] The clock generator 16 is operable to receive an external master clock signal (xl) and 
operable to provide the master clock signal (xl) and one or more internal clock signals (x2 
x4, x8) to the components of the active memory. It should be apparent to one skilled in the' 
art that other internal clock signals may be produced by the clock generator 16. 
[0036] The TDU 18 communicates with the bus interface 14, the HMI 12, the programmable 
SRAM 24, the array control sequencer: 26, and the DCU 20. In the current embodiment the 
TDU 18 functions as an interface to allow the host to issue a sequence of commands to the 
array control sequencer 26 and the DCU 20. Task commands from the host may be buffered 
» the TDU's FIFO buffers to allow a burst command to be issued. Commands may contain 
mformation on how the tasks in the array control sequencer 26 and the DCU 20 should be 
synchronized with one another, among others. 

[0037] The DCU 20 arbitrates between the TDU 1 8 and the HMI 12 and sends commands to 
the DRAM modules 22 and the processing element array 28. The DCU 20 also schedules 
refreshes within the DRAM modules 22. In one embodiment, the DRAM modules 22 of the 
active memory 10 may be comprised of sixteen 64k x!28 eDRAM (or embedded DRAM) 
cores. Each eDRAM core may be connected to an array of sixteen PEs, thus requiring 256 
(16 x 16) PEs in all. 

[0038] The programmable SRAM 24 functions as a program memory by storing commands 
.sued by the TDU 18. For example, the TDU 18 may transmit a "write program memory 
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address" command which sets up a start address for a write operation and a "write program 
memory data" command which writes a memory location and increments the program 
memory write address, among others. The programmable SRAM 24, in the current 
embodiment, has both an address register and a data output register. 

[0039 J The array control sequencer 26 is comprised of a simple "16 bit minimal instruction set 
computer (16-MISC). The array control sequencer 26 communicates with the TDU 18, the 
programmable SRAM 24, and the DCU 20, and is operable to generate register file addresses 
for the processing element array 28 and operable to sequence the array commands, among 
others. 

[0040] The processing element array 28 is comprised of a multitude of processing elements 
("PEs") 30 (see FIG. 2) connected in a variety of different arrangements depending on the 
design requirements for the processing system. For example, processing units may be 
arranged in hypercubes, butterfly networks, one-dimensional strings/loops, and two- 
dimensional meshes, among others. For discussion of the current embodiment, the PEs 30 are 
arranged in an 16x16, 2 -dimensional loop connected array (see FIG. 7). 
[0041] The processing element array 28 communicates with the DRAM module 22 and 
executes commands received from the programmable SRAM 24, the array control sequencer 
26, the DCU 20, and the HMI 12. Each PE in the processing element array 28 includes 
dedicated H-registers for communication with the HMI 12. Control of the H-registers is 
shared by the HMI 1 2 and the DCU 20 . 

[0042] Referring now to FIG. 2, a block diagram of a PE 30 according to one embodiment of 
the present invention is illustrated. PE 30 includes an arithmetic logic unit ("ALU") 32, Q- 
registers 34, M-registers 36, a shift contrpl and condition register 38 (also called "condition 
logic" 38), a result register pipeline 40, and register file 42. The PE 30 may also contain other 
components such as multiplexers 48 and logic gates (not shown), among others. 
[0043] In the current embodiment, the Q-registers 34 are operable to merge data into a 
floating point format and the M-Registers 36 are operable to de-merge data from a floating 
point format into a single magnitude plus an exponent format, among others. The Q- and M- 
registers may receive data from Q and M shift buses, respectively, and from the result register 
pipeline 40, among others. 

[0044] The ALU 32 includes a multiplier-adder operable (among others) to receive 
information from the Q-registers 34 and M-registers 36, execute tasks assigned by the TDU 
18 (see FIG. 1), and transmit results to the condition logic 38 and to the result register 
pipeline 40. The result register pipeline 40 is operable to communicate with the register file 
42, which holds data for transfer into or out of the DRAM modules 22 via a DRAM interface 
44. Data is transferred between the PE and the DRAM module 22 via a pair a registers, one 
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register being responsive to the DCU 20 and the other register being responsive to the PE 30. 
The DRAM interface 44 receives command information from the DCU 20. The DRAM 
interface 44 also permits the PE 30 to communicate with the host through the host memory 
access port 46. 

[0045] In the current embodiment, the H-registers 42 are comprised of synchronous SRAM 
and each processing element within the processing element array 28 contains eight H-registers 
42 so that two pages can be stored from different DRAM locations, thus allowing the 
interleaving of short i/o bursts to be more efficient. Result register pipeline 40 also includes 
one or more neighborhood connection registers ("X-register") (see FIG. 3). The X-register 
links one PE 30 via a transfer network to its neighboring PE's 30 in the processing element 
array 28. 

[0046] FIG. 3 is a more detailed illustration of some components of the processing element 
of FIG. 2 according to an embodiment of the present invention. For example in FIG. 3, M- 
registers 36 include of four (4) registers M0 - M3 each having an associated multiplexer 
MMP0 - MMP3, respectively, which receive signals from the result pipe 40 (among others) 
via multiplexer 54. The output of registers M0 - M3 are connected to ALU 32 via 
multiplexer 52. Furthermore, Q-registers 34 include of four (4) registers Q0 - Q3 each 
having an associated multiplexer QMPO - QMP3, respectively, which receive signals from 
each other and from the output of the M-registers 36 (among others). The output of registers 
Q0 - Q3 are connected to ALU 32 via multiplexer 50. 

[0047] Additionally, result pipe 40 includes four (4) registers R0, Rl , R2, and X, as well as 
several multiplexers (i.e:, RMP1, RMP2, XMP). The output of registers R0, Rl, and R2 may 
be sent, for example, to M-registers 36 via multiplexer 54 and to the ALU 32 via multiplexer 
50. Furthermore, the output of the X register may be sent back to registers Rl and R2 in the 
result pipe 40 and sent to neighboring PEs via a transfer network accessed through node X- 
OUT. 

[0048] The transfer network refers to the interconnections which allow PEs to communicate 
with each other via their associated X registers. Referring briefly to FIG. 7 for example, the 
loop connected 16x16 2-D array 28 for the current embodiment is illustrated. A loop 
connected array refers to an array whose edge PEs (e.g., those in the first and last rows and 
the first and last columns) have a similar level of connectivity as non-edge arrays. FIG. 7 
illustrates the connectivity of the rows and columns, respectively, of the array 28. More 
specifically in the loop connected 2-D array 28, the connections between edge PEs "wrap" 
around the column and rows, thus, both edge and non-edge PEs can transfer data to four 
neighboring PEs. For example in FIG. 7, PE C , is a non-edge PE which can communicate with 
its neighbors to the north (i.e., PE A/ ), south (i.e., PE,,) , east (i.e., PE C ,), and west (i.e., PE c0 ) 
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and even though PE, is an edge PE, P Ec0 can commodate with it neighbors to the north 
(i.e., PE M ), south, (i.e., PE d0 ), east (,.e., PE C/ ), and west (i.e., PE C/J ) due to the loop 
connection. It should be noted that loop connection for an n-dimensional array provides 2n 
neighbors for each PE (i.e., two neighbors in each dimension). 

[0049] It should be noted that the number of PEs 30 included in array 28 may be altered 
while remaining within the scope of the present invention. Additionally, the number of 
dimensions for array 28 may be vaned while remaining with the scope of the present 
invention. It should be further noted that each PE 30 is interconnected with its neighboring 
PEs via an associated X-register link. Accordingly, information can be shared among the 
PEs. It should be noted that the information may flow in any direction (i.e., north-to-south, 
south-to-north, east-to-west, and west-to-east) while remaining within the scope of the present 



invention. 



[0050] Returning to FIG. 3, the X register is loaded through the X multiplexer (XMP) which 
selects one of the output of registers R0, Rl, and R2 and the output of multiplexer 48, among 
others. It should be noted that multiplexer 48 receives signals XS, XE, XN, and XW from the 
transfer network. For example, XS represents the X_Out output from the instant PE's 
southern neighbor, XE represents the X_Out output from the instant PE's eastern neighbor, 



etc. 



[0051] ALU 32 includes a 16-bit multiplier adder ("MA") and a logic unit, among others. In 
the current embodiment, the MA is designed to allow two's-compliment addition or 
subtraction and signed magnitude addition or subtraction, The logic unit is designed to allow 
logical functions between two arguments such as bit-wise OR and AND functions, among 
others. Condition logic 38 includes Z, N, and C flag registers, as well as an SCR register As 
illustrated, the MA and the logic unit communicate with the C flag register via multiplexer 56 
and with the SCR register and the result pipe 40 via multiplexer 58. 
[0052] It should be noted that the detailed illustrated of PE 30 in FIG. 3 has a number of 
components, signal lines, and connections omitted for clarity. It should be apparent to those 
skilled in the art that additional components, signal lines, and connections may added while 
remaining within the scope of the present invention. 

[0053] FIG. 4 illustrates an operational process 60 for determining a global extrema for an 
array of processing elements according to an embodiment of the present invention. 
Operational process 60 begins when the local extrema for each PE is placed onto the transfer 
network in operation 61. 

[0054] For example in the current embodiment, each PE in array 28 (see FIG. 7) receives a 
set of values from the DRAM interface 44 and the host memory access port 46 (see FIG. 1), 
among others. After the values are assigned to each PE in the array 28, each PE determines' 
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.» local exftema. In .he current embodtmen,, local extreme refers «o the maximum or 
mrnrmun, value for , set of va , ues ^ an .^.^ p£ a ^ ^ ^ 

local exftema from a se, of values on an indtvidual PE is dlscussed ln more detaj , * 

Paten, App,,oa,i„„ Serial No. entiled "Method for Finding Loea, Exftema of a Se.of 

Va es for a ParaUe, Proeessmg BW filed (DBOO, 076-000, Micron „„. 03- 

0052) and incorporated in its entirety by reference herein. 

10055, In the eunten. embodunen^ each local exftema is a ,6-bi, value and „ referred ,„ aj , 

As dtseussed above, Ore PEs used in the content embedment are an S-bit proeessmg 
element. T^, each short „ proCKSed ^ ^ ^ ^ ^ ^ ^ ^ • 

and a leas, signified" (LS) byte. The convenfion used in ,he oun-en. alterative 
embod.men, is known as "big-endian", dm, is ,he MS byte is stored in Ore LS-register file 
address. „ should he noted ,ha, otoer m e,hods of findmg ,he local exftema for each PE may 
be nhhzed while remaining within the scope of ,he present invention 
[0056, Once Ore local exhema has been detemnned, each PE places „s loea, extoema onto 
■he transfer network in operation 6 1 For examp!. ,„ the cutren, embodimen,, each PE uses 
* assocated X register to plane i,s loeal exftema onto the transfer network 
,0057, After each PE places its local exhema on me transfer aamAi an extrema js 
detenntned for each fine in a firs, dimension of Ihe array in opemnon 62. In the curre „ t 
embodiment, for exampte, each PE compares its local exhema to .he loca, exhema of .he 
other PEs within its row to determine a row exhema (i.e, for its associated row,. Eaeh PE 
transoms its local exhema v» me transfer network to each other PE within ,he same row 
Thus, each PE wimin „ e same row will calculate same row exftema as .he other PEs wfthin 
mat row. 

,0058, For example referring to FIG. 7, assume ma, me loea, exhemas for .he PEs in row-e 
0*. <he th.rd row, are determined as foltows: PE„ - 2, PE., . 5, PE« - ,, P Eo . 6 , PE , 4 , 3 , 

, , *'" S Accordmgly, the local exftemas for row-e may be represented by ,he se, 
ofvalues (2,5, 1,6,3,2,4,5,3,5,3,4,0, 1,4,5). Each PE wfthm row-e (i e PE 0 PE I 
receives via , me transfer network, a nd detennmes me row exhem a ftom, ,his se. of Jalne, 
,0059! „ should be noted ,ha, eaeh PE will reeetve the se, of values in a different order 
Referrmg to FIG. 7, for example, cons.der me embodimen, where me date is moved ftom 
ngh. to left. PE tJ wi,, see i,s own value (i.e., 1), fo.lowed by values movmg ,„ from me r,gh, 
ban s.de so . e order of .he se, ofvalues for PE. wfl, be „, 6. 3, 2, 4, 5, 3, 5, 3, 4, 0, I, 4, 5, 
2, }. PE,, w,l, reeeive .he same se. ofvalues, however, PE, will see „s own value (i e 6) 
followed by values movmg ,„ ftom the ngh. band side. Thus, the order of the se, of valu'es ' 
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for PE^ will be {6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 4, 5, 2, 5, 1}. The specific order of the set of 
values for the remaining PEs may be found in a similar manner. 

[0060] In operation 62, each PE in a row receives a set of values from the transfer network 
and simultaneously determines the row extrerna for its associated row. For simplicity, the 
current discussion will be limited to finding the high row extrerna for the array 28, however, it 
should be noted that a low row extrerna may be determined while remaining within the scope 
of the present invention. For example, each PE in row-c receives the set of values {2,5,1, 
6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 4, 5}and determines that the high row extrerna for row-c is equal 
to 6 in operation 62. It should be noted that the PEs in the other rows of array 28 
simultaneously determine the row extrerna for their associated row. 

[0061] After a line extrerna is found for each line for a first dimension in the array, a line 
extrerna is found for each line for a next dimension in the array in operation 63. For example 
in the current embodiment, each PE determines the column extrerna for its associated column 
by comparing its row extrerna (as calculated in operation 62) to the row extrerna of the other 
PEs within its column. Each PE transmits its row extrerna via the transfer network to each 
other PE within the same column. Thus, each PE within the same column will calculate - same 
column extrerna as the other PEs within that column. 

[0062] Again referring to FIG. 7, assume that the row extremas for the array 28 are 
determined in operation 62 as follows: row-a = 7, row-b - 3, row-c = 6, row-d = 4, row-e = 5, 
row-f = 4, row-g = 2, row-h = 3, row-i = 6, row-j - 4, row-k = 2, row-1 = 3, row-m = 5, row-n 
= 1, row-o = 2, row-p = 3. Accordingly, the row extremas for the array 28 may be 
represented by the set of values {7, 3, 6, 4, 5, 4, 2, 3, 6, 4, 2, 3, 5, 1, 2, 3} . In operation 63, 
each PE in the column receives the set of row extrerna values via the transfer network and 
determines the column extrerna from this set. In the instant example, each PE determines that 
the high column extrerna is equal to 7. It should be noted that the low column extrerna (here 
equal to 1) may also be determined while remaining within the scope of the present invention. 
[0063] Operational process 60 then continues with determination process 64. If the array has 
another dimension, control branches YES and operation 63 is repeated for the next dimension. 
If the array does not have another dimension, control branches NO and operation 65 
terminates operational process 60. 

[0064] It should be noted that, for the 2-dimensional array in the instant example, the value 
of the column extrerna also represents the value of the array extrerna (i.e., each PE will have 
as its column extrerna the local extrerna value from PE having the largest local extrerna (i.e., 
high array extrerna) or smallest local extrerna (i.e., low array extrerna)). 

[0065] As mentioned above, the local extrerna for each PE is a short (i.e., 16-bits) which may 
be separated into a MS byte and a LS byte. It should be noted that each dimensional extrerna 
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is also a short. The shorts, as seen by each PE in the current embodiment, are in binary 
format. For simplicity, however, the shorts are represented in decimal format in the instant 
discussion. Accordingly, the set may be represented as {short-1, short-2, short-3, ...short- 
16}, where the numerals 1-16 represent each short's location within the set. 
[0066] The position of each value within the value set may be designated as {1, 2, 3, 4, ... 
N} and thus, depending on its position within the set, a value may be designated as either 
"odd numbered" or "even numbered." For example, an "odd numbered value" refers to those 
values located at an odd numbered position within the value set. The odd numbered values 
for PE c2 in the current embodiment, for example, are values {1, 3, 4, 3, 3, 0, 4, 2} which are 
located at odd numbered positions 1, 3, 5, 7, 9, 11, 13, and 15 respectively in PE c2 's data set. 
Because the local extrema are obtained in a different order for each PE, the odd numbered 
values for PE c3 in the current embodiment, however, are values {6, 2, 5, 5,4,1, 5, 5} which 
are located at the odd numbered positions 1, 3, 5, 7,9, 11,13, and 15, respectively, in PE c3 's 
data set. Similarly, an "even numbered value" refers to those values located at an even 
numbered position within the value set. The even numbered values for PE c2 in the current 
embodiment, for example, are values {6, 2, 5, 5, 4, 1, 5, 5} which are located at even 
numbered positions 2, 4, 6, 8, 10, 12, 14, and 16, respectively, in PE c2 's data set. The even 
numbered values for PE c3 in the current embodiment, however, are values {3, 4, 3, 3, 0, 4, 2, 
1} which are located at the even numbered positions 2, 4, 6, 8, 10, 12, 14, and 16, 
respectively, in PE c3 's data set. 

[0067] It should be noted that in the current embodiment the local and dimensional extrema 
for each PE are placed on the transfer network one byte at a time. For example in operation 
61, the LS-byte of each local extrema is placed on the transfer network and transferred to one 
or more PEs within the array. The MS-byte of each local extrema is later placed on the 
transfer network and transferred to one or more PEs within the array. 

[0068] It should further be noted that the order in which the LS-bytes and the MS-bytes are 
placed onto the transfer network and the number of PEs to which the bytes are transferred to 
may be altered while remaining within the scope of the present invention. For example in the 
current embodiment, the LS and MS bytes are transferred in bursts. The burst length may be 
selected to approximately equalize the number of lost transfer cycles and the number of lost 
ALU cycles. This effectively reduces the number of lost transfer cycles as compared to the 
first approach discussed above. Additionally by using bursts, the ALU can start comparing 
shorts more quickly as compared to the second approach discussed above, thus reducing the 
number of lost processing cycles. Once the ALU is started, the use of bursts helps to 
minimize the time that the ALU is required to wait for data. 
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[0069] In the current embodiment, the LS bytes and MS bytes are alternately bursted. The 
size of each burst is selected to fill the local memory. For example, LS bytes are bursted until 
all local registers (R0,R1,R2) are full, then MS bytes are bursted until local registers are full, 
then LS bytes are bursted until local registers are full, etc. until all shorts have been 
processed. 

[0070J It should be noted that the resources of an array of parallel processing elements may 
be "maximized" in various ways. For example, where the input data arrives via the transfer 
network, maximization may occur when the number of lost cycles of each processor is 
approximately equal to the number of lost cycles for the transfer network, among others: 
Alternatively where the input data is read from the register file 42, maximization may occur 
when the number of lost cycles for each processor is approximately equal to the number of 
cycles lost while reading from the register file 42. For example, the resources of a PE may be 
maximized (as disclosed in "Method for Finding Local Extrema of a Set of Values for a 
Parallel Processing Element" (DB001076-000, Micron no. 03-0052)) such that zero cycles are 
lost when reading from the register file 42. 

[0071] In the current embodiment, "maximization" is achieved using a burst 6 bytes in 
length. Once started, for continuous operation, a further seven (7) cycles are required to 
transfer a burst six (6) bytes in length (i.e., the resources of the PE are maximized such that 
approximately only one (1) in seven (7) cycles is lost). For sixteen (16) shorts, the current 
embodiment is completed after approximately (7 x16x2/6) = 37.33 cycles. It should be 
noted that for every three (3) further shorts there is one cycle lost on the transfer network and 
one cycle lost in the ALU. It should further be noted that in any practical implementation 
'end effects' occur. Thatis, when the algorithm is initialized, and when it is terminated, 
additional cycles may be required. For example to process the sixteen (16) shorts in the 
current embodiment, the transfer network and the ALU both require thirty-five (35) cycles. 
However, the transfer network operates for nine (9) cycles before the ALU begins to operate. 
Also a single 'housekeeping' cycle is present at the termination of the algorithm. 
Accordingly, a total of (9 + 35 + 1) = 45 cycles are completed in the current embodiment. 
[0072] It should be apparent to one skilled in the art that in an alternative embodiment the 
order of operations 62 and 63 may be reversed while remaining within the scope of the 
present invention. For example, the column extrema may be found first. The set of column 
extrema are then used to determine the row extrema. In the instance in which order of 
operations 62 and 63 are reversed, the value determined for the row extrema also represents 
the value of the array extrema. 

[0073] It should further be noted that the present invention may be employed for arrays of 
other sizes and shapes. For example, the present invention may be used to balance an K x L x 
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M x ...etc., n-dimensional array of processing elements (PEs), wherein K represents the 
number of PEs on a line traversing a first dimension of the array, L represents the number of 
PEs on a line traversing a second dimension of the array, M represents the number of PEs on 
a line traversing a third dimension of the array, etc. More generally, the present invention 
may be used to balance and an array having (N) PE's traversing each line in a dimension, 
where N may be different for each dimension. 

[0074] One example may be a 3 x 5 x 7 array in which the array is comprised of three (3) 
lines in a first dimension, five (5) lines in a second dimension, and seven (7) lines in a third 
dimension. Applying operational process 60 to the 3 x 5 x 7 array, each PE calculates the 
extrema for its row (i.e., the first dimension) using the local extrema of the other PEs in the 
associated row. Next, each PE calculates the extrema for its column (i.e., the second 
dimension) using the row extrema of the other PEs in the associate column. Then, each PE 
calculates the extrema for its line in the third dimension using the column extrema of the other 
PEs in its third dimensional line. 

[0075] FIG. 5 illustrates an operational process 70 for determining a dimensional (e.g., row, 
column, line, etc.) extrema of a single dimension within an N-dimensional array of processing 
elements according to an embodiment of the present invention. For example, operational 
process 70 may be used by a processing element to determine the row extrema for an 
associated row as previously discussed in conjunction with operation 63 above. Likewise, 
operational process 70 may be used by a processing element to determine the column extrema 
for an associated column as previously discussed in conjunction with operation 64 above. 
[0076] For simplicity, the current embodiment of operational process 70 will be discussed in 
conjunction with finding the dimensional extrema of row-c of array 28. More specifically, the 
dimensional extrema of row-c as determined by processing element PE c0 will be discussed. 
[0077] Operational process 70 begins when each PE within the row receives the input values 
(i.e., the local extrema shorts) from the other PEs within the associated row in operation 71. 
As discussed above, each PE receives the same values but in a different order. For example 
in the current embodiment, PE c0 receives the set of values {2, 5, 1, 6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 
4, 5}, PE CI receives the set {5, 1, 6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 4, 5, 2}, PE c2 receives the set {1, 
6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 4, 5, 2, 5}, etc. 

[0078] In operation 72, the odd numbered local extrema shorts are placed into an odd 
pipeline and the even numbered local extrema shorts are placed into an even pipeline. Each 
pipeline is made up of one or more registers (among others). PE c0 having the input value set 
{2, 5, 1, 6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 4, 5}, for example, places short-1 (i.e., 2), short-3 (i.e., 1), 
short-5 (i.e., 3), short-7 (i.e., 4), short-9 (i.e., 3), short-1 1 (i.e., 3), short-1 3 (i.e., 0), and short- 
15 (i.e., 4) into its odd pipeline and short-2 (i.e., 5), short-4 (i.e., 6), short-6 (i.e., 2), short-8 
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(i.e., 5), short-10 (i.e., 5), short-12 (i.e., 4), short-14 (i.e., 1), and short-1 6 (i.e., 5) into its even 
pipeline. 

[0079] Once the shorts are separated into the odd and even pipelines in operation 72, an odd 
extrema is determined for the shorts within the odd pipeline and an even extrema is 
determined for the shorts within the even pipeline in operation 73. In the current 
embodiment, the ALU is used to compare the odd-numbered shorts to the other odd numbered 
shorts and the odd extrema is determined. Likewise, the ALU is used to compare the even- 
numbered shorts to the other even numbered shorts and the even extrema is determined. For 
example for row-c, the high odd extrema is determined to be four (4) (i.e., short-7 and short- 
15), whereas the high even extrerria is determined to be six (6) (i.e., short-4). 
[0080] The odd and even extrema determined in operation 73 are then compared to each 
other and a dimensional extrema is determined in operation 74. For example in the current 
embodiment, the high odd and high even extrema determined in operation 73 are compared to 
each other to determine a high row extrema of six (6) for row-c. - 

[0081] -It should be noted that each short within the odd and even pipelines may further be 
divided into a LS-byte and a MS-byte. For example, the odd extrema may be stored within 
two registers which may further be initialized with the LS-byte and the MS-byte, respectively, 
of the first short placed in the odd pipeline (e.g., short-1). Then, the LS-byte of short- 1 may 
be compared to the LS-byte of the next short (e.g., short-3) within the odd pipeline. 
Depending on the result, the value of a carry flag may be set. Using the carry flag value, the 
MS-byte of short-1 may then compared to the MS-byte of short-3. Depending on the result, 
the value an odd flag may be set. The registers containing the odd extrema may be updated 
with the new short (e.g., short-3) or mayj continue to hold their current values (e.g., short-1) 

depending on the value of the odd flag. -The even pipeline may function in a similar manner 

i 

with the even shorts. 

[0082] After reading the following discussion, it should become apparent to those skilled in 
the art that operational process 70 may be implemented simultaneously by each processor in 
the array 28, and that operational process 70 may be applied to other sizes of arrays and other 
types of arrays (e.g., non-square N-dimensional arrays) while remaining within the scope of 
the present invention. 

[0083] FIGS. 6a - 6h graphically represent operational process 70 as applied to a single line 
of array 28 as illustrated in FIG. 7 according to an embodiment of the present invention. 
More specifically, FIGS. 6a - 6h graphically represent operational process 70 as applied to 
array 28 to determine the row extrema of row-c as implemented by PE c0 . 
[0084] As illustrated in FIGS. 6a-6h, the movement of the information throughout the PE is 
divided into a series of clock pulse, or cycles. It should be noted that all operations within a 
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cycle happen simultaneously. Thus if a register is being "read from" and "written to" in the 
same cycle, the "old" data moves out of the register at the same time that the "new" data 
moves into the register. Accordingly, the old data is not lost. In the current embodiment, the 
old contents of a particular register will be that value written to the register during the cycle 
immediately preceding the current cycle. If a value was not written to the particular register 
during the cycle immediately preceding the current cycle, the old contents of the register will 
be the last value written to the register during the closest preceding cycle to the current cycle. 
[0085] With clock pulse number one (1), the PE loads the LS-byte of its local extrema onto 
the transfer network and the MS-byte of its local extrema into its first result register. As 
discussed above in conjunction with FIG. 3, the transfer network refers to the 
interconnections which allows a PE to communicate with its neighboring PEs via their 
associated X registers. The local extrema is determined, for example, as discussed above in 
conjunction operation 62 of FIG. 4. In the instant embodiment, the PEs of row-c in array 28 
have local extrema represented by the set of values {2, 5, 1, 6, 3, 2, 4, 5, 3, 5, 3, 4, 0, 1, 4, 5}. 
Accordingly during the first clock pulse, PE c0 loads its LS-byte (i.e., LS c0 ) of its local extrema 
into its X register and its MS-byte (i.e., MS c0 ) of its local extrema into its register R0. 
[0086] After the PE loads the LS-byte of its local extrema onto the transfer network and the 
MS-byte of its local extrema into its first result register, the following actions occur 
simultaneously during clock pulse number two (2): the value within the X register is loaded 
into a first register and the value of the X register is placed on the transfer network and is 
shifted around the loop (e.g. row) of the transfer network one PE at a time. This has the effect 
that the X register receives the next local extrema byte from a PE adjacent on the loop of the 
transfer network. . 

[0087] For example in the current embodiment, LS c0 is loaded from the X register into 
register Rl via RMP1, LS c0 is shifted westward (i.e., towards PE c7 ) via X_Out on the transfer 
network, and LS C , is loaded into PE c0 's X register via multiplexer 48 and XMP. It should be 
noted that LS C , is the local extrema from PE c0 's closest eastern neighbor (i.e., PE cl ). 
[0088] The following actions occur simultaneously during clock pulse number three (3): the 
value in the first register is transferred to a second register, the value within register X is 
loaded into the first register, and the X register retrieves the next local extrema byte from the 
transfer network. 

[0089] In the current embodiment, LS c0 is transferred from register Rl to register M2 via 
multiplexer 54 and multiplexer MMP2, LS cl is transferred from the X register to register Rl 
via the multiplexer RMP1, and LS c2 (i.e., the LS-byte of the local extrema for PE c2 ) is loaded 
into the X register via multiplexer 48 and XMP. 
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[0090] The following actions simultaneously occur during clock pulse number four (4): the 
value in register X is loaded into the first register, the value in the first register is loaded into a 
third register, the value within the second register is loaded into a fourth register, and the X 
register retrieves the next local extrema byte from the transfer network. 
[0091] Accordingly in the current embodiment, LS c2 is loaded into register Rl via the 
multiplexer RMP1, LS c i is transferred from register Rl into register M0 via the multiplexer 54 
and multiplexer MMP0, LS c0 is loaded from register M2 into register Q2, and LS c3 (i.e., the 
LS-byte for the local extrema of PE c3 ) is loaded into the X register via multiplexer 48 and 
XMP. It should be noted that in the current embodiment register Q2 contains the odd extrema 
LS-byte. It should further be noted that register Q2 is initialized with the first odd LS-byte 
that is processed, here LS c0 . 

[0092] The following actions simultaneously occur during clock pulse number five (5): the 
value in register X is loaded into the first register, the value in the first register is transferred 
into the second register, the value within the third register is loaded into a fifth register, and 
the X register retrieves the next local extrema byte from the transfer network. 
[0093] Accordingly in the current embodiment, LS c3 is loaded into register Rl via the 
multiplexer RMP 1 , LS c2 is transferred from register Rl into register M2 via the multiplexer 54 
and multiplexer MMP2, LS cl is transferred from register M0 to register Q0, and LS c4 (i.e., the 
LS-byte for the local extrema of PE c4 ) is loaded into the X register via multiplexer 48 and 
XMP. It should be noted that in the current embodiment register Q0 contains the even 
extrema LS-byte. It should further be noted that register Q0 is initialized with the first even 
LS-byte that is processed, here LS cl . 

[0094] During clock pulse number sixi(6), the following actions simultaneously occur: the 
value in register X is loaded into a sixth register, the value in the first register is transferred 
into the third register, and the X register retrieves the next local extrema byte from the 
transfer network. 

[0095] Accordingly in the current embodiment, LS c4 is loaded into register R2 via the 
multiplexer RMP2, LS c3 is transferred from register Rl into register M0 via the multiplexer 54 
and multiplexer MMP0, and LS c5 (i.e., the LS-byte for the local extrema of PE c5 ) is loaded 
into the X register via multiplexer 48 and XMP. As is apparent from Figs, 6a and 6b, the first 
burst of LS bytes are transferred during clock pulse number two (2) through clock pulse 
number six (6). 

[0096] During clock pulse number seven (7), the following actions simultaneously occur: the 
value in register X is loaded into the first register and the value in the first result register is 
loaded into the X register. 
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[0097] Accordingly in the current embodiment, LS c5 is loaded into register Rl via the 
multiplexer RMP1 and MSco (i.e., the MS-byte for the local extrema of PE c0 ) is transferred 
from register RO into the X register via multiplexer XMP. 

[0098] During clock pulse number eight (8), the following actions simultaneously occur: the 
value in register X is loaded into the first register, the value in the first register is transferred 
into the ALU which updates the first result register with the transferred value, and the X 
register retrieves the next local extrema from the transfer network. 
[0099] Accordingly in the current embodiment, MS c0 is loaded into register Rl via the 
multiplexer RMP1, LS cS is transferred into the ALU which updates register R0 with LS c5 , and 
MS cl (i.e., the MS-byte for the local extrema of PE cl ) is loaded into the X register via 
multiplexer XMP. 

[00100] The following actions simultaneously occur during clock pulse number nine (9): the 
value in register X is loaded into the first register, the value in the first register is transferred 
to a seventh register, and the X register retrieves the next local extrema from the transfer 
network. 

[00101] Accordingly in the current embodiment, MS cl is loaded into register Rl via the 
multiplexer RMP1, MS c0 is transferred from Rl to register M3 via the multiplexer 54 and 
multiplexer MMP3, and MS e2 is loaded into the X register via multiplexer XMP. 
[00102] The following actions simultaneously occur during clock pulse number ten (10): the 
value in register X is loaded into the first register, the value in the first register is transferred 
to an eighth register, the values in the second and the fourth registers are transferred to the 
ALU and compared, the value in the seventh register is transferred to the ninth register and 
the X register retrieves the next local extrema from the transfer network. 
[00103] Accordingly in the current embodiment, MS c2 is loaded into register Rl via the 
multiplexer RMP1, MS cl is transferred from register Rl to register Ml via multiplexer 54 and 
multiplexer MMP1, LS c0 and LS c2 are transferred to the ALU and compared, MSco is 
transferred from register M3 to register Q3 via multiplexer QMP3, and MS c3 is loaded into the 
X register via multiplexer XMP. It should be noted that in the current embodiment register 
Q3 contains the odd extrema MS-byte. It should further be noted that register Q3 is 
initialized with the first odd MS-byte that is processed, here MS e0 . 

[00104] In the current embodiment, the values loaded into the ALU (i.e., LS c0 and LS c2 ) are 
compared using the Multiplier/Adder (MA) and Logic Unit. For example, the MA subtracts 
the value contained in the second register (i.e., M2) from the value contained in the fourth 
register (i.e., Q2). If the result is negative (i.e., if the value within second register is greater 
than the value within the fourth register), then the carry flag (i.e., flag C in the control logic 
38) is set to zero (0). If the result is positive or zero (i.e., the value within the fourth register 

19 



00444148.DOC 



is greater than or equal to the value within the second register), then the carry flag is set to one 
(1). 

[00105] For example in the instant case, LSco (which is contained in the fourth register, Q2) 
and LS c2 (which is contained within the second register, M2) are loaded into the MA. The 
value within M2 is subtracted from the value within Q2 (i.e., Q2 - M2) and the carry flag is 
set to zero (0) if the result is negative and set to one (1) if the result is positive or zero. It 
should be apparent to those skilled in the art that other types of comparisons may be used 
while remaining within the scope of the ; present invention, for example, subtracting Q2 from 
M2. 

[00106] During clock pulse number eleven (1 1), the following actions simultaneously occur: 
the value in register X is loaded into the first register, the value in the first register is 
transferred to the seventh register, the value in the eighth register is transferred to a tenth 
register, the values in the first and ninth registers are transferred to the ALU and compared, 
and the X register retrieves the next local extrema from the transfer network. 
[00107] Accordingly in the current embodiment, MS c3 is loaded into register Rl via the 
multiplexer RMP1, MS c2 is transferred from Rl to register M3 via multiplexer 54 and 
multiplexer MMP3, MS cl is transferred from register Ml to register Ql via multiplexer 
QMP1, MS c2 and MS c0 are transferred from register Rl and register Q3, respectively, to the 
ALU and compared, and MS c4 is loaded into the X register via multiplexer XMP. It should be 
noted that in the current embodiment register Ql contains the even extrema MS-byte. It 
should further be noted that register Ql is initialized with the first even MS-byte that is 
processed, here MS C i . 

[00108] As discussed above in conjunction with clock pulse number 10, the values loaded into 
the ALU (i.e., MS c2 and MS c0 ) are compared using the Multiplier/Adder (MA) and Logic 
Unit. For example, the MA performs a 'subtract with carry' of the value contained in the first 
register (i.e., Rl) from the value contained in the ninth register (i.e., Q3). If the result is 
negative (i.e., if the value within first register is greater than the value within the ninth 
register), then the odd flag (i.e., flag C in the control logic 38) is set to zero (0). If the result is 
positive or zero (i.e., the value within the ninth register is greater than or equal to the value 
within the first register), then the odd flag is set to one (1). 

[00109] It should be noted that the 'subtract with carry' operation is a standard arithmetic 
algorithm as is known in the art. The subtraction of the MS byte includes the carry bit from 
the subtraction of the LS bytes as discussed above, for example, in conjunction with clock 
pulse 10. In the current embodiment, if the carry flag is zero (i.e., signaling a negative result) 
then an extra value of 1 is subtracted from the result of the MS byte calculation. Other 
arithmetic operations may be used while remaining within the scope of the present invention. 
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[00110] It should be noted that a comparison of the first odd numbered shorts of the set of 
values is completed in clock pulses ten (10) and eleven (1 1).- More specifically, the LS-bytes 
of short-1 and short-3 are compared in clock pulse ten (10), whereas the MS-bytes of short- 1 
and short-3 are compared in clock pulse eleven (1 1). Additionally, the second, fourth, 
seventh, and ninth registers (i.e., registers M2, Q2, M3, and Q3, respectively) form a portion 
of the odd pipeline. 

[00111] During clock pulse number twelve (12), the following actions simultaneously occur: 
the value in register X is loaded into the first register, the value in the first register is 
transferred to the eighth register, the values in the third and fifth registers are transferred to 
the ALU and compared, the MS-byte of the odd numbered shorts is conditionally updated, 
and the X register retrieves the next local extrema from the transfer network. 
[00112] Accordingly in the current embodiment, MS c4 is loaded into register Rl via the 
multiplexer RMP1, MS c3 is transferred from Rl to register Ml via multiplexer 54 and 
multiplexer MMP1, LS c3 and LS C| are transferred from register M0 and register Q0, 
respectively, to the ALU and compared, register Q3 is conditionally updated with LS c2 from 
register M3 using the odd flag, and MS c5 is loaded into the X register via multiplexer XMP. 
[00113] In the current embodiment, the values loaded into the ALU (i.e., LS c3 and LS cl ) are 
compared using the Multiplier/Adder (MA) and Logic Unit. For example, the MA subtracts . 
the value contained in the third register (i.e., M0) from the value contained in the fifth register 
(i.e., Q0). If the result is negative (i.e., if the value within third register is greater than the 
value within the fifth register), then flag C in the control logic 38 is set to zero (0). If the 
result is positive or zero (i.e., the value within the fifth register is greater than or equal to the 
value within the third register), then the carry flag (i.e., flag C in the control logic 38) is set to 
one(l). 

[00114] For example in the instant case, LS C| (which is contained in the fifth register, Q0) and 
LS c3 (which is contained within the third register, M0) are loaded into the MA. The value 
within M0 is subtracted from the value within Q0 (i.e., Q0 - M0) and the carry flag is set to 
zero (0) if the result is negative and set to one (1) if the result is positive or zero. It should be 
apparent to those skilled in the art that other types of comparisons may be used while 
remaining within the scope of the present invention, for example, subtracting Q0 from M0. 
[00115] Additionally during clock pulse number twelve, register Q3 is conditionally updated 
with MS c2 . In the current embodiment, the value of the odd flag determined during clock 
pulse 1 1 is used to conditionally update the MS-byte of the odd short in the ninth register. 
For example if the largest value on the PE is to be found (i.e., the high extrema), then the 
value in the seventh register will be loaded into the ninth register when the odd flag is equal to 
zero (0), whereas the value within the ninth register will remain in the ninth register when the 
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odd flag is equal to one (1). In this case, the ninth register may be referred to as the "max 
register for the odd short's MS-byte" because the largest short in the odd pipeline that has 
thus far been found by the process has its MS-byte stored in the ninth register. For example 
in the current embodiment where the high extrema of set of shorts (i.e., {2, 5, 1, 6, 3, 2, 4, 5}) 
within row-c is being determined, the MS-byte of short- 1 (which is greater than the MS-byte 
of short-3) remains within register Q3 because the odd flag is set equal to one (1) during clock 
pulse number eleven. 

[00116] Likewise, if the smallest value on the PE is to be found (i.e., the low extrema), then 
the yaluein the seventh register will be loaded into the ninth register when the odd flag is 
equal to one (1), whereas the value within the ninth register will remain in the ninth register 
when the odd flag is equal to zero (0). In this case, the ninth register may be referred to as the 
"min register for the odd short's MS-byte" because the smallest short in the odd pipeline that 
has thus far been found by the process has its MS-byte stored in the ninth register. For 
example in the current embodiment where the low extrema of set of shorts (i.e., {2, 5, 1, 6, 3, 
2, 4, 5}) within row-c is being determined, the MS-byte of short-3 (which is less than the MS- 
byte of short- 1) is loaded into register Q3 because the odd flag is set equal to orie (1) during 
clock pulse number 11. 

[00117] During clock pulse number (13), the following actions simultaneously occur: the 
value in register X is loaded into the first register, the value in the first register is transferred 
to the seventh register, the values in the eighth and tenth registers are transferred to the ALU 
and compared, the LS-byte of the odd numbered shorts is conditionally updated, and the X 
register retrieves the next local extrema from the transfer network. 
[00118] Accordingly in the current embodiment, MS c5 is loaded into register Rl via the 
multiplexer RMP1, MS c4 is transferred from Rl to register M3 via multiplexer 54 and 
multiplexer MMP3, MS c3 and MS ci are transferred to the ALU and compared, register Q2 is 
conditionally updated with LS c2 from register M2, and MS c6 is loaded into the X register via 
multiplexer XMP. 

[00119] As discussed above in conjunction with clock pulse number twelve, the values loaded 
into the ALU (i.e., MS c3 and MS cl ) are compared using the Multiplier/ Adder (MA) and Logic 
Unit. For example, the MA subtracts the value contained in the eighth register (i.e., Ml) from 
the value contained in the tenth register (i.e., Ql). If the result is negative (i.e., if the value 
within eighth register is greater than the value within the tenth register), then the even flag 
(i.e., flag C in the control logic 38) is set to zero (0). If the result is positive or zero (i.e., the 
value within the tenth register is greater than or equal to the value within the eighth register), 
then the even flag is set to one (1). 



00444148.DOC 



22 



[00120] It should be noted that a comparison of the first even numbered shorts of the set of 
values is completed in clock pulses twelve and thirteen. More specifically, the LS-bytes of 
short-2 and short-4 are compared during clock pulse twelve, whereas the MS-bytes of short-2 
and short-4 are compared during clock pulse thirteen. It should further be noted that the third, 
fifth, eighth, and tenth registers (i.e., registers MO, Q0, Ml, and Ql, respectively) form a 
portion of the even pipeline. 

[00121] Referring to Figs. 6b and 6c the first burst of MS bytes are transferred from the X- 
register during clock pulse number nine (9) through clock pulse number sixteen (16). 
Likewise referring to Figs. 6c through 6f, bursts of LS bytes are transferred from the X- 
register during clock pulse number eighteen (18) through clock pulse number twenty-three 
(23) and during clock pulse number thirty-three (33) through clock pulse number thirty-six 
(36), whereas bursts of MS bytes are transferred during clock pulse number twenty-five (25) 
through clock pulse number thirty-one (3 1). 

[00122] Referring to FIG. 6a - 6h, it can be seen that the remaining shorts are loaded and 
moved throughout the odd and even pipelines. It can also be seen that the odd numbered 
shorts are compared to the odd extrema that is conditionally saved in registers Q3 and Q2. 
For example, the LS-byte of short-5 (i.e., LS c4 ) and the MS-byte of short-5 (i.e., MS c4 ) are 
compared to the LS-byte odd extrema and to the MS-byte odd extrema, respectively during 
clock pulses numbered fourteen and fifteen and the LS-byte of short-7 (i.e., LS c6 ) and the MS- 
byte of short-7 (i.e., MS c6 ) are compared to the LS-byte odd extrema and to the MS-byte odd 
extrema, respectively, during clock pulses numbered nineteen and twenty. 
[00123] Likewise, it can be seen that-the remaining even numbered shorts are compared to the 
even extrema that is conditionally saved in registers Ql and Q0. For example, the LS-byte of 
short-6 (i.e., LS c5 ) and the MS-byte of short-6 (i.e., MS cS ) are compared to the LS-byte even 
extrema and to the MS-byte even extrema, respectively during clock pulses numbered sixteen 
and seventeen and the LS-byte of short-8 (i.e., LS c7 ) and the MS-byte of short-8 (i.e., MS c7 ) 
are compared to the LS-byte odd extrema and to the MS-byte odd extrema, respectively, 
during clock pulses numbered twenty-one and twenty-two. 

[00124] Referring now to clock pulse number forty-one (41), it can be seen that after the last 
odd numbered LS-byte has been compared to the LS-byte of the odd extrema, the LS-byte of 
the odd extrema is loaded into the second register from the fourth register (among others). In 
the current embodiment, the LS-byte of the odd extrema is loaded from register Q2 to register 
M2, among others. 

[00125] Likewise during clock pulse number forty-two (42), it can be seen that after the last 
odd numbered MS-byte has been compared to the MS-byte of the odd extrema, the MS-byte 
of the odd extrema is loaded into the seventh register from the ninth register (among others). 
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In the current embodiment, the MS-byte of the odd extrema is loaded from register Q3 to 
register M3, among others. 

[00126] During clock pulses number forty-three (43), the LS-byte of the odd extrema is 
compared to the LS-byte of the even extrema, whereas in operation forty-four (44), the MS- 
byte of the odd extrema is compared to the MS-byte of the even extrema. As previously 
discussed, the value assigned to the carry flag during clock pulses forty three (43) and forty- 
four (44) is dependent upon the results of the comparison. 

[00127] During clock pulse number forty-five (45), the contents of the second and eighth 
registers are conditionally saved to the fourth and tenth registers, respectively. For example, 
in the current embodiment if the odd extrema is greater than the even extrema, the contents of 
registers M2 and Ml are loaded into registers Q2 and Ql, respectively. On the contrary, if the 
even extrema is greater than or equal to the odd extrema, the MS-byte and the LS-byte of the 
even extrema remain in registers Q2 and Ql, respectively. In either instance, the values 
within registers Q2 and Ql after clock pulse number forty-five (45) represent the MS-byte 
and the LS-byte, respectively, of the dimensional extrema for row-c. 

[00128] It should be recognized that the above-described embodiments of the invention are 
intended to be illustrative only. Numerous alternative embodiments may be devised by those 
skilled in the art without departing from the scope of the following claims. 
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What is claimed is: 

1 • A method for finding an extrema for an n-dimensional array having a plurality of 

processing elements, the method comprising: 

determining within each of said processing elements a first dimensional extrema 
for a first dimension of said n-dimensional array, wherein said first dimensional 
extrema is related to one or more local extrema of said processing elements in said 
first dimension and wherein said first dimensional extrema has a most significant byte 
and a least significant byte; 

determining within each of said processing elements a next dimensional extrema 
for a next dimension of said n-dimensional array, wherein said next dimensional 
extrema is related to one or more of said first dimensional extrema and wherein said 
next dimensional extrema has a most significant byte and a least significant byte; and 

repeating said determining within each of said processing elements a next 
dimensional extrema for each of said n-dimensions, wherein each of said next 
dimensional extrema is related to a dimensional extrema from a previously selected 
dimension. 



2 . The method of claim 1 wherein said determining within each of said processing 
elements a first dimensional extrema for a first dimension of said n-dimensional array 
comprises: 

receiving a set of local extrema from one or more of said processing elements 
within said first dimension; 

separating said set of local extrema into an odd numbered set of local extremas 
and an even numbered set of local extremas; 

separating each of said odd numbered local extrema into at least one of an odd 
most significant byte and an odd least significant byte; 

separating each of said even numbered local extrema into at least one of an evei 
most significant byte and an even least significant byte; 

determining an odd extrema from said odd numbered set; 

determining an even extrema from said even numbered set; and 

determining said first dimensional extrema for a first dimension from said odd 
extrema and said even extrema. 
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3. The method of claim 2 wherein said receiving a set of local extrema from one or more 
of said processing elements within said first dimension comprises: 

receiving a burst of said odd and even least significant bytes; and 
receiving a burst of said odd and even most significant bytes. 

4. The method of claim 3 further comprising: 

selecting a burst length for said burst of odd and even least significant bytes and a 
burst length for said odd and even most significant bytes to minimize the amount of 
lost cycles within said processing elements. 

5. The method of claim 1 wherein said determining within each of said processing 
elements a next dimensional extrema for a next dimension of said n-dimensional array 
comprises: 

receiving a set of said first dimensional extrema from one or more of said 
processing elements within said next dimension; 

separating said set of first dimensional extrema into an odd numbered set and an 
even numbered set; 

separating each of said odd numbered set into at least one of an odd most 
- significant byte and an odd least significant byte; 

separating each of said even numbered set into at least one of an even most 
significant byte and an even least significant byte; 

determining a odd extrema from said odd numbered set; • 

determining an even extrema from said even numbered set; and 

determining said next dimensional extrema for a next dimension from said odd 
extrema and said even extrema. 

6. The method of claim 5 wherein said receiving a set of said first dimensional extrema 

- t 

from one or more of said -processing elements within said next dimension comprises: 
receiving a burst of said odd and even least significant bytes; and 
receiving a burst of said odd and even most significant bytes. 

7. The method of claim 6 further comprising: 

selecting a burst length for said burst of odd and even least significant bytes and a 
burst length for said odd and even most significant bytes to minimize the amount of 
lost cycles within said processing elements. 
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8. The method of claim 1 wherein said repeating said determining within each of said 
processing elements a next dimensional extrema for each of n-said dimensions comprises: 

receiving a set of dimensional extrema from a previously selected dimension 
from one or more of said processing elements within a currently selected dimension; 

separating said set of dimensional extrema from said previously selected 
dimension into an odd numbered set and an even numbered set; 

separating each of said odd numbered set into at least one of an odd most 
significant byte and an odd least significant byte; 

separating each of said even numbered set into at least one of an even most 
significant byte and an even least significant byte; 

determining a odd extrema from said odd numbered set; 

determining an even extrema from said even numbered set; and 

determining said next dimensional extrema for said next dimension from said odd 
extrema and said even extrema. 

9. The method of claim 2 wherein determining an odd extrema from said odd numbered 
set comprises: 

loading the least significant byte of an odd numbered local extrema into a least 
significant odd byte register; 

loading the most significant byte of said odd numbered local extrema into a most 
significant odd byte register; 

comparing the contents of said least significant odd byte register to the least 
significant byte of another odd numbered local extrema and setting a carry flag 
relative to said comparison; 

comparing the contents of said most significant odd byte register to the most 
significant byte of said another odd numbered local extrema and setting an odd flag 
relative to said comparison; and 

conditionally updating said most significant odd byte register and said least 
significant odd byte register relative to said odd flag. 

10. The method of claim 2 wherein said determining an even extrema from said even 
numbered set comprises. 

loading the least significant byte of an even numbered local extrema into a least 
significant even byte register; 

loading the most significant byte of said even numbered local extrema into a most 
significant even byte register; 
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comparing the contents of said least significant even byte register to the least 
significant byte of another even numbered local extrema and setting a carry flag 
relative to said comparison; 

comparing the contents of said most significant even byte register to the most 
significant byte of said another even numbered local extrema and setting an even flag 
relative to said comparison; and 

conditionally updating said most significant even byte register and said least 
significant even byte register relative to said even flag. 

1 1 . The method of claim 9 further comprising repeating said comparing the contents of 
said most significant odd byte register, said comparing the contents of said least significant 
odd byte register, and said conditionally updating said most significant even byte register and 
said least significant even byte register for each of said odd numbered local extrema within 
said set. 

12. The method of claim 10 further comprising repeating said comparing the contents of 
said most significant even byte register, said comparing the contents of said least significant 
even byte register, and said conditionally updating said most significant even byte register and 
said least significant even byte register for each of said even numbered local extrema within 
said set. 

13. The method of claim 2 wherein said determining said next dimensional extrema for a 
next dimension from said odd extrema and said even extrema further comprises: 

loading the least significant byte of said odd extrema into a least significant odd 
byte register and the most significant byte of said odd extrema into a most significant 
odd byte register; 

loading the least significant byte of said even extrema into a least significant even 
byte register and the most significant byte of said even extrema into a most significant 
even byte register; 

comparing the contents of said least significant even byte register to the contents 
of said least significant odd byte register and setting a carry flag relative to a result of 
said comparison; 

comparing the contents of said most significant even byte register to the contents 
of said most significant odd byte register and setting an extrema flag relative to a 
result of said comparison; and 
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conditionally updating said most significant even byte register and said least 
significant even byte register relative to said extrema flag. 

14. A method comprising: 

identifying extrema within a data stream as having one of an odd or an even 
position, said extrema having a most significant byte and a least significant byte; 

processing said extrema having an odd position to produce an odd extrema, said 
odd extrema having a most significant byte and a least significant byte; 

processing said extrema having an even position to produce an even extrema, said 
even extrema having a most significant byte and a least significant byte; and 

determining a dimensional extrema from said odd extrema and said even extrema, 
said dimensional extrema having a most significant byte and a least significant byte. 

15. The method of claim 14 wherein said processing said extrema having an odd position 
comprises: 

loading the least significant byte of an odd numbered local extrema into a least 
significant odd byte register; 

loading the most significant byte of said odd numbered local extrema into a most 
significant odd byte register; 

comparing the contents of said least significant odd byte register to the least 
significant byte of another odd numbered local extrema and setting a carry flag 
relative to said comparison; 

comparing the contents of said most significant odd byte register to the most 
significant byte of said another odd numbered local extrema and setting an odd flag 
relative to said comparison; and 

conditionally updating said most significant odd byte register and said least 
significant odd byte register relative to said odd flag. 

16. The method of claim 14 wherein said identifying extrema within a data stream as 
having one of an odd or an even position comprises: 

loading a burst of said least significant bytes of said odd and said even positioned 
extrema, wherein said odd positioned least significant bytes are loaded into a first 
plurality of registers and said even positioned least significant bytes are loaded into a 
second plurality of registers; and 

loading a burst of said most significant bytes of said odd and said even positioned 
extrema, wherein said odd positioned most significant bytes are loaded into said first 
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plurality of registers and said even positioned most significant bytes are loaded into 
said second plurality of registers. 

17. The method of claim 14 wherein said processing said extrema having an even 
position comprises: 

loading the least significant byte of an even numbered local extrema into a least 
significant even byte register; 

loading the most significant' byte of said even numbered local extrema into a most 
significant even byte register; 

comparing the contents of said least significant even byte register to the least 
significant byte of another even numbered local extrema and setting a carry flag 
relative to said comparison; 

comparing the contents of said most significant even byte register to the most 
significant byte of said another even numbered local extrema and setting an even flag 
relative to said comparison; and 

conditionally updating said most significant even byte register and said least 
significant even byte register relative to said even flag. 

18. The method of claim 15 further comprising repeating said comparing the contents of 
said most significant odd byte register, said comparing the contents of said least significant 
odd byte register, and said conditionally updating said most significant even byte register and 
said least significant even byte register for each of said odd numbered local extrema within 
said set. 

19. The method of claim 16 further comprising repeating said comparing the contents of 
said most significant even byte register, said comparing the contents of said least significant 
even byte register, and said conditionally updating said most significant even byte register and 
said least significant even byte register for each of said even numbered local extrema within 
said set. 

20. The method of claim 14 further comprising: 

determining a next dimensional extrema for a next dimension for a n-dimensional 
array, wherein said next dimensional extrema is related to said dimensional extrema, 
said next dimensional extrema having a most significant byte and a least significant 
byte.; and 
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repeating said determining a next dimensional extrema for each of said n- 
dimensions, wherein each of said next dimensional extrema is related to a 
dimensional extrema from a previously selected dimension. 

21. A method for determining a dimensional extrema for an n-dimensional array of 
processing elements, comprising: 

loading odd numbered extrema from a set of said processing elements in a first 
dimension into a first plurality of registers; 

loading even numbered extrema from a set of set processing elements into a 
second plurality of registers; 

comparing certain of said loaded odd numbered extrema to produce an odd 
extrema, said odd extrema having a most significant byte and a least significant byte; 

comparing certain of said loaded even numbered extrema to produce an even 
extrema, said even extrema having a most significant byte and a least significant byte; 
and 

producing a dimensional extrema in response to said odd extrema and said even 
extrema, said dimensional extrema having a most significant byte and a least 
significant byte. 

22. The method of claim 21 wherein said loading odd numbered extrema from a set of 

said processing elements in a first dimension into a first plurality of registers comprises: 

i . 

loading said least significant byte of an extrema having an odd position into a 

i 

first register; ! 

i 

transferring said least significant byte of said extrema in said first register into a 
second register and loading said most significant byte of said extrema into said first 
register; 

transferring said most significant byte of said extrema in said first register into a 
third register and loading said least significant byte of another extrema having an odd 
position into said first register; and 

transferring said least significant byte of said another extrema in said first register 
into a fourth register and loading said most significant byte of said another extrema 
into said first register. 

23. The method of claim 21 wherein said loading even numbered extrema from a set of 
set processing elements into a second plurality of registers comprises: 
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loading said least significant byte of an extrema having an even position into a 
first register; 

transferring said least significant byte of said extrema in said first register into a 
second register and loading said most significant byte of said extrema into said first 
register; 

transferring said most significant byte of said extrema in said first register into a 
third register and loading said least significant byte of another extrema having an 
even position into said first register; and 

transferring said least significant byte of said another extrema in said first register 
into a fourth register and loading said most significant byte of said another extrema 
into said first register. 

24. The method of claim 22 wherein said comparing certain of said loaded odd numbered 
extrema to produce an odd extrema comprises: 

comparing said least significant byte of said extrema in said second register to 
said least significant byte of said another extrema in said fourth register; and 

comparing said most significant byte of said extrema in said third register to said 
most significant byte of said another extrema in said first register. 

25. The method of claim 23 wherein said comparing certain of said loaded even 
numbered extrema to produce an even extrema comprises: 

comparing said least significant byte of said extrema in said second register to 
said least significant byte of said another extrema in said fourth register; and 

comparing said most significant byte of said extrema in said third register to said 
most significant byte of said another extrema in said first register. 

26. The method of claim 24 wherein said comparing certain of said loaded odd 
numbered extrema to produce an odd extrema further comprises: 

updating said second and third registers with said odd extrema; 
loading said least significant byte of a next extrema having an odd position into 
said first register; 

transferring said least significant byte of said next extrema in said first register 
into a fourth register and loading said most significant byte of said next extrema into 
said first register; and 
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repeating said comparing steps, said updating step, and said loading and 
transferring steps for each remaining extrema in an odd position within said data 



stream. 



27. The method of claim 25 wherein said comparing certain of said loaded even 

numbered extrema to produce an even extrema further comprises: 

updating said second and third registers with said even extrema; and 
loading said least significant byte of a next extrema having an even position into 
said first register; 

transferring said least significant byte of said next extrema in said first register 
into a fourth register and loading said most significant byte of sa ld next extrema into 
said first register; and 

repeating said comparing steps, said updating step, and said loading and 
transferring steps for each remaining extrema in an even position within said data 
stream. 



28. The method of claim 2 1 further comprising: 

determining a next dimensional extrema for a next dimension of said n- 
dimensional array, wherein said next dimensional extrema is related to said 
d,mensional extrema, said next dimensional extrema having a most significant byte 
and a least significant byte; and 

repeating said determining a next dimensional extrema for each of said n- 
dimensions, wherein each of said next dimensional extrema is related to a 
dimensional extrema from a previously selected dimension. 



29. 



The method of claim 21 wherein said loading odd numbered extrema from a set of 
said processing elements in a first dimension into a first plurality of registers and said loading 
even numbered extrema from a set of set processing elements into a second plurality of 
registers comprises: 

loading a burst of said least significant bytes of said odd and said even numbered 
extrema, wherein said odd numbered least significant bytes are loaded into sa,d first 
plurality of registers and said even numbered least significant bytes are loaded into 
said second plurality of registers; and 

loading a burst of said most significant bytes of said odd and said even numbered 
extrema, wherein said odd numbered most significant bytes are loaded into said first 
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plurality of registers and said even numbered most significant bytes are loaded into 
said second plurality of registers. 

30. A memory device carrying a set of instructions which, when executed, perform a 
method comprising: 

determining within each of said processing elements a first dimensional extrema 
for a first dimension of said n-dimensional array, wherein said first dimensional 
extrema is related to one or more local extrema of said processing elements in said 
first dimension and wherein said first dimensional extrema has a most significant byte 
and a least significant byte; 

determining within each of said processing elements a next dimensional extrema 
for a next dimension of said n-dimensional array, wherein said next dimensional 
extrema is related to one or more of said first dimensional extrema and wherein said 
next dimensional extrema has a most significant byte and a least significant byte; and 

repeating said determining within each of said processing elements a next 
dimensional extrema for each of said n-dimensions, wherein each of said next 
dimensional extrema is related to a dimensional extrema from a previously selected 
dimension. 



00444148.DOC 



34 



ABSTRACT OF THE DISCLOSURE 
A method for finding an extrema for an n-dimension'al array having a plurality of 
processing elements, the method comprises determining within each processing element a 
first dimensional extrema for a first dimension, wherein the first dimensional extrema is 
related to the local extrema of the processing elements in the first dimension and wherein the 
first dimensional extrema has a most significant byte and a least significant byte, determining 
within each processing element a next dimensional extrema for a next dimension of the n- 
dimensional array, wherein the next dimensional extrema is related to the first dimensional 
extrema and wherein the next dimensional extrema has a most significant byte and a least 
significant byte; and repeating the determining within each processing element a next 
dimensional extrema for each of the n-dimensions, wherein each of the next dimensional 
extrema is related to a dimensional extrema from a previously selected dimension. 



00444148.DOC 



35 



m 



Master Clock 
x1 



i 



Clock Gen. 



16 



TTTT 

x1 x2 x4 x8 



1/14 



intr flag cs w r Addr Data 



rdy Cmd Addr Data 



Bus Interface 



14 




FIFO 



Task Dispatch Unit 
(TDU) 

18 



ill 



Host Memory 
Interface 
(HMI) 

12 



FIFO 



Array Control 
Sequencer 



26 



sig 



FIFO 



DRAM Control 
Unit 
(DCU) 

20 



Processing 
Element Array 


RF 






DRAM Module 




28 




H 






22 




i 




44 





10 



FIG 1 



2/14 



Q-Shift Bus 



M-Shift Bus 



Q-Registers 



34 



M-Registers 



ALU 



32 



To 
Host 



r-4 



Host 
Memory 
Access 

Port 46 



DRAM 
Interface 

44 



Tf 



To DRAM 
Module 



Condition 
Logic 38 



H-Registers 



Register File 
42 



Result Pipe 
R0-R2.X 



40 



FIG 2 



34 



3/14 



1 


liliil 




IIII 


Q3 | q; 


> | Q1 | QO 




- 1 






In cm o I 

a a a a 

t"w4 


s|eJs| 





32 



MA 



J 



38 



| Z | N | C | SCR"" | 



Register File 
42 



FIG 3 



36 



-ujl jlj. ♦ 

kMMP3/ \mMP2/ \mMP1/ 



| M3 ] [" M2 | 




M1 




52 



Logic Unit 



3 r 



I RQ I 

lis 



R1 



L_L£ 



X R MP2X 



R2 



\xmp7 

_f_ 



MO 



54 



TTTT 

RO R1 R2 #0 



30 



40 



48 



XN 
XE 

xs 
xw 



X_OUT 

-►o 



Ml 



4/14 



Place Local Extrema 
onto Transfer Network 



Determine Line Extrema For 
Each Line of PEs in a First 
Dimension 



Determine Line Extrema For 
Each Line of PEs in a Next 
Dimension 




FIG 4 



5/14 



71^ 



Load local extremas from 
other PEs within line into 
each PE within the line. 



72^ 



Separate odd numbered local extrema into 
an odd pipeline and even numbered local 
extrema into an even pipeline 



72^r Determine odd extrema and even extrema 



Determine dimensional extrema from the 
odd extrema and the even extrema 



FiG 5 



6/14 



TO 
CD 




8/14 



o 




9/14 



T3 
CD 

O 
LL 




10/14 co 

CD 




11/14 



CO 

CD 





FIG 7 



